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Abstract 

This paper evaluates heterogeneous information fusion using multi-task Gaussian processes 
in the context of geological resource modeling. Specifically, it empirically demonstrates 
that information integration across heterogeneous information sources leads to superior 
estimates of all the quantities being modeled, compared to modeling them individually. 
Multi-task Gaussian processes provide a powerful approach for simultaneous modeling of 
multiple quantities of interest while taking correlations between these quantities into con- 
sideration. Experiments are performed on large scale real sensor data. 
Keywords: Gaussian process, Information fusion, Geological resource modeling 



1. Introduction 

In applications such as space-exploration, mining or agriculture automation, modeling the 
underlying resource is a fundamental problem. For such applications, an efficient, flexible 
and high-fidelity representation of the geology is critical. The key challenges in realizing this 
are that of dealing with the problems of uncertainty and incompleteness. Uncertainty and 
incompleteness are virtually ubiquitous in any sensor based application as sensor capabilities 
are limited. The problem is magnified in a field automation scenario due to sheer scale of 
the application. Incompleteness is a major problem in any large scale resource modeling 
endeavor as sensors have limited range and applicability. A more significant contributor to 
this issue is that of cost - sampling and collecting such data is expensive. Geological data is 
typically collected through various sensors/processes of widely differing characteristics and 
consequently lead to different kinds of information. Often the resource is characterized by 
numerous quantities (for example, soil composition in terms of numerous elements). These 
quantities often are correlated. 

Given these issues, large scale geological resource modeling needs a representation that 
can handle spatially correlated, incomplete and uncertain data. Not only must the corre- 
lation between homogeneous quantities be modeled but also that between heterogeneous 
quantities. This paper uses a Gaussian process (GP) representation of resource data similar 



to that described in Vasudevan et al. (2009). GPs are ideally suited to handling spatially 
correlated data. This paper further uses an extension of the basic Gaussian process model, 
the multi-task Gaussian process (MTGP), to simultaneously model multiple quantities of 
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interest. The proposed model not only captures spatial correlations between individual 
quantities with themselves (at different locations) but also that between totally different 
quantities that together quantify the resource. That the quantities modeled in this paper 
exhibit strong correlation is known from geological sciences. This paper presents an empir- 
ical evaluation to understand (1) if simultaneous modeling of multiple quantities of interest 
(i.e. modeling and using the correlations between them and hence performing data fusion) 
is better than modeling these quantities independently and (2) if the nonstationary kernels 
are more effective than stationary kernels at modeling geological data. Experiments are 
performed on large scale real sensor data. 



2. Related work 



Gaussian processes (GPs) (Rasmussen and Williams 2006) are powerful non-parametric 



Bayesian learning techniques that can handle correlated, uncertain and incomplete data. 
They have been used in a range of fields, the Gaussian process web-sit^ lists several ex- 
amples. GPs produce a scalable multi-resolution model of the entity under consideration. 
They yield a continuous domain representation of the data and hence can be sampled at any 
desired resolution. GPs incorporate and handle uncertainty in a statistically sound manner 
and represent spatially correlated data appropriately. They model and use the spatial cor- 
relation of the given data to estimate the values for other unknown points of interest. GPs 



basically perform a standard interpolation technique known as Kriging (Matheron, 1963) 



The work (Vasudevan et al. , 2009), performed large scale terrain modeling with GPs. 



It proposed the use of non-stationary kernels (neural network) to model large scale dis- 
continuous spatial data. A performance comparison between GPs based on stationary 
(squared exponential) and non-stationary (neural network) kernels as well as several other 
standard interpolation methods applicable to alternative representations of terrain data, 
was reported. The non-stationary neural network kernel was found to be superior to the 
stationary squared exponential kernel and at least as good as most standard interpolation 
techniques for a range of terrain (in terms of sparsity/complexity /discontinuities). The work 
presented in this paper builds on this GP representation. However, it addresses the prob- 
lem of simultaneous modeling multiple heterogeneous quantities of interest, in the context 
of geological resource modeling. This requires the modeling and usage of the correlations 
between these quantities towards improving predictions of each of them - an instance of 
data fusion using Gaussian processes. 

Data fusion in the context of Gaussian processes is necessitated by the presence of 
multiple, multi-sensor, multi-attribute, incomplete and/or uncertain data sets of the entity 
being modeled. Two preliminary attempts towards addressing this problem include (El- 



Beltagy and Wright, 2001) and (Murray-Smith and Pearlmutter, 2005). The former bears a 



"hierarchical learning" flavor to it in that it demonstrates how a GP can be used to model an 
expensive process by (a) modeling a GP on an approximate or cheap process and (b) using 
the many input-output data from the approximate process and the few samples available of 
the expensive process together in order to learn a GP for the latter. The latter work attempts 
to generalize arbitrary transformations on GP priors through linear transformations. It hints 
at how this framework could be used to introduce heteroscedasticity (random variables with 
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non-constant variance) and how information from different sources could be fused. However, 
specifics on how the fusion can actually be performed are beyond the scope of the work. 



Girolami in (Girolami, 2006) integrated heterogeneous feature types within a Gaussian 



process classification setting, in a protein fold recognition application domain. Each feature 
representation is represented by a separate GP. The fusion uses the idea that individual 
feature representations are considered independent and hence a composite covariance func- 
tion would be defined in terms of a linear sum of Gaussian process priors. A recent work 
by Reece et al. (Reece et al. , 2011) integrated "hard" data obtained from sensors with 



"soft" information obtained from human sources within a Gaussian process classification 
framework. This problem/approach is different from the work presented here. It uses het- 
erogeneous information domains (i.e. kinds of information) as mutually independent sources 
of information that are transformed into the kernel representation (a kernel for each kind 
of information) and combined using a product rule (a linear sum in Girolami's work). The 
focus thus, is on encoding or representing different kinds of information in a common math- 
ematical framework using kernels. This paper is concerned with a "higher level" data fusion 
problem of heterogeneous-source information integration after it has been represented us- 
ing kernel methods. The experiments of this paper demonstrate the case when information 
from each source is itself from a homogeneous domain - e.g. the heterogeneous input data 
are all real numbers. The approach presented in this paper improves the estimate of several 
different quantities being simultaneously modeled by explicitly modeling the correlation be- 
tween multiple heterogeneous information sources. If this is not the case (e.g. input data is 
made up of qualitative and quantitative data dimensions), each of heterogeneous informa- 
tion types can be represented by separate kernels and these can be combined using a sum 



or product as has been done in (Girolami 2006 Reece et al. , 2011). Simpler data fusion 



approaches demonstrated in (Vasudevan 2012) may be applied. However, the application 
of the approach presented in this paper (based on multi-output or multi-task GPs) will 
require a non-trivial derivation of auto and cross covariance terms for kernels applied on 
heterogeneous information types. 

Examples of related works that use multiple sources of the same kind of information 



within a single GP representation framework include (Thompson and Wettergreen, 2008) 



and (Dragiev et al. , 2011 ). Whereas the former uses single output GPs to incorporate in-situ 



surface spectra information and remotely sensed spectra information into a kilometer scale 
map of the environment, the latter uses a GP implicit surface representation of an object 
that has to be grasped and manipulated. The representation incorporates visual, haptic and 
laser data into a single representation of the object. Data from each of these sensor modal- 
ities conditions the GP prior based on the implied surface at that point (on/outside/inside 
the object). 

Two recent approaches demonstrating data fusion with Gaussian processes in the con- 



text of large scale terrain modeling were based on heteroscedastic GPs (Vasudevan et al. 



2010a') and dependent GPs (Vasudevan et al. , 2010b, 2011). These address the problem of 



fusing multiple, multi-sensor data sets of a single quantity of interest. This paper describes 
the framework for extending this concept to multiple heterogeneous quantities of interest. 



The work (Vasudevan et al. , 2010a) treated the data- fusion problem as one of combining dif- 



ferent noisy samples of a common entity (terrain) being modeled. In the Machine Learning 



community, this idea is referred to as heteroscedastic GPs ( Goldberg et al. , 1998 Yuan and 
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Wabha, 2004 Le et al., 2005 Kersting et al. 2007). The works (Vasudevan et al., 2010b) 



and (Vasudevan et al. , 2011) treated the data fusion problem as one of improving GP re- 
gression through modeling the spatial correlations (auto and cross covariances) between 
several dependent GPs representing the respective data sets. This idea has been inspired 
by recent machine learning contributions in multi-task or multi-output or dependent GP 



on (Higdon, 2002). In Kriging terminology, this idea is akin to Co-kriging (Wackernagel 



on (Higdon 


2003 


)• 



modeling including (Bonilla et al. , 2007) and (Boyle and Prean, 2004), the latter being based 



The work (Melkumyan and Ramos, 2011) provided preliminary findings to geological 
resource modeling using various combinations of stationary kernel including the squared ex- 



ponential (SQEXP), Matern 3/2 and a sparse covariance function (Melkumyan and Ramos 



2009). For a geological resource modeling data set taken from a mine, it found the Matern 
3/2 - Matern 3/2 - SQEXP kernel combination provided best performance in terms of the 
prediction error. This paper reports a detailed benchmarking experiment, using cross val- 
idation methods, performed on a multi-task GP, an equivalent set of GPs and a set of 
independently optimized GPs, to provide for an exact and an independent comparison be- 
tween them. The objective is to quantify the benefit (if any) of simultaneous modeling of 
the multiple quantities by modeling and using the correlations between them as against 
modeling each of these quantities separately. This paper also compares data fusion using 
multiple stationary and nonstationary kernels in the context of modeling geological data. 
An extensive review of kernel meth ods applied in mode ling vector valued functions 



was presented in a recent survey paper (Alvarez et al. , 2012). The paper discusses differ 



ent approaches to develop kernels for multi-task applications and draws parallels between 
regularization perspective of this problem and a Bayesian one. The latter perspective is 
discussed through Gaussian pro cesses. The work pre sented in this paper focuses on one 



of the approaches reviewed in (Alvarez et al. , 2012); specifically, it addresses modeling 



and information fusion of multi-task geological data using Gaussian processes developed 
using the process convolution approach. The paper presents a detailed empirical study of 
the approach applied to a large scale real world problem in order to evaluate its efficacy 
for information fusion, to understand the modeling capabilities of different kernels (chosen 
apriori) with such data and to understand broader approach-related questions from an ap- 
plication perspective. The paper also ties together past works of the authors within the 
process convolution theme. 



3. Approach 

3.1 Gaussian processes 



Gaussian processes (Rasmussen and Williams 2006) (GPs) are stochastic processes wherein 



any finite subset of random variables is jointly Gaussian distributed. They may be thought 
of as a Gaussian probability distribution in function space. They are characterized by a 
mean function m(x) and the covariance function k{x, x') that together specify a distribution 
over functions. In the context of geological resource modeling, each x = (east, north, depth) 
(3D coordinates) and /(x) = z, the concentration of the quantity being modeled. Although 
not necessary, the mean function m(x) may be assumed to be zero by scaling/shifting the 
data appropriately such that it has an empirical mean of zero. 
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The covariance function or kernel models the relationship between the random variables 



corresponding to the given data. It can take numerous forms (Rasmussen and Williams 



2006, chap. 4). The stationary squared exponential (or Gaussian) kernel (SQEXP) is given 



by 



fc5Q£;xp(x,x',i;) = aj. exp ^-^(x - x')^i;(x - x')^ 



(1) 



where k is the covariance function or kernel; S = diag[ least , Inorth j Idepth is a d x d 
diagonal length-scale matrix (d = dimensionality of input = 3 in this case), a measure of 
how quickly the modeled function changes in the east, north and depth directions; cr^ is 
the signal variance. The set of parameters { least i Inorth i Idepth , (^f } are referred to as 



the kernel hyperparameters. The non-stationary neural network (NN) kernel (Neal, 1996 
Williams, 1998a||b ) takes the form 

2 . / 2i^Sx' \ 



k 



NN 



x, x',S) 



o"? . — arcsin 



V(l + 2i^Si)(l + 2i'^Sx') 



(2) 



where x and x' are augmented input vectors (each point is augmented with a 1), S = 
diag\ j3 , least , Inorth , Idepth IS the (d + 1) x (d + 1) diagonal length-scale matrix (as 
before) with /3 being a bias factor and d being the dimensionality of the input data. The 
variables { f3 , least , Inorth 1 Idepth , (^f } Constitute the kernel hyperparameters. The 
NN kernel represents the covariance function of a neural network with a single hidden layer 
between the input and output, infinitely many hidden nodes and using a Sigmoidal transfer 
function (Williams, 1998a) for the hidden nodes. Hornik, in (Hornik, 1993), showed that 
such neural networks are universal approximators and Neal, in (Neal, 1996), observed that 
the functions produced by such a network would tend to a Gaussian process. Prior work in 
(Vasudevan et al. , 2009) found the NN kernel to be more effective than the SQEXP kernel 
at modeling discontinuous data. The Matern 3/2 kernel is another stationary kernel differ- 
ing from the SQEXP kernel in that the latter is infinitely differentiable and consequently 
tends to have a strong smoothing nature, which is argued as being detrimental to modeling 
physical processes ( Rasmussen and Williams , 2006 ) . It takes the form specified in Equation 



m 



kMATERN3{x,x',T,) 



n 



.1 H — )exp 



where feel, 
this case), S 



\<k<d 

d is the dimension of the input data [d 



Ik 



(3) 



dimensionality of input = 3 in 



east 



north 



depth 



] is a 1 X (i length-scale matrix, a measure of how 



quickly the modeled function changes in the east, north and depth directions; a'j is the 
signal variance. The set of parameters { least , Inorth ) ^-depth > ^/ } are referred to as the 
kernel hyperparameters. 

Regression using GPs uses the fact that any finite set of training (evaluation) data and 
test data of a GP are jointly Gaussian distributed. Assuming noise free data, this idea is 
shown in Expression [4] (hereafter referred to as Equation [4]) . This leads to the standard 
GP regression equations yielding an estimate (the mean value, given by Equation [s]) and 
its uncertainty (Equation^. 



z 



N[0, 



K{X,X) K{X,X^] 
K{X^,X) K{X^,X^ 



(4) 
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f,=K{X„X) KiX,X)-' z 
cov(/,) = K{X^,X^) - K{X^,X)K{X,X)-^K{X,X^ 



(5) 
(6) 



For n training points {X,z) = (xi, Zj)j=i...„ and n^: test points {X^:,f^:), K{X,X^) 
denotes the n x matrix of covariances evaluated at all pairs of training and test points. 
The terms K[X,X), K{X^,, X^) and K[X^,,X) are defined likewise. In the event that 
the data being modeled is noisy, a noise hyperparameter (cr) is also learnt with the other 
GP hyperparameters and the covariance matrix of the training data K{X, X) is replaced 
by [K{X, X) + cT^I] in Equations [4| [s] and [sj GP hyperparameters may be learnt using 



various techniques such as cross validation based approaches (Rasmussen and Williams 



2006) and maximum-a-posteriori approaches using Markov Chain Monte Carlo techniques 



( Rasmussen and Williams , 2006 Williams 1998b') and maximizing the marginal likelihood 



of the observed training data ( Rasmussen and Williams , 2006 Vasudevan et al. 2009 ) . This 



paper adopts the latter most approach based on the intuition that it may be more suited 
for large data sets. The marginal likelihood to be maximized is described in Equation [7j 



log p{z\X, ( 



z^K{X, Xy'z - - log \K{X, X 



1 



n 



log(2^) 



(7) 



3.2 Multi-task Gaussian processes (MTGPs) 

The problem being addressed in this paper can be described as follows. The objective 
is to model multiple heterogeneous quantities (e.g. concentrations of various elements) of 
the entity in consideration (e.g. land mass). The data fusion aspect of this problem is 
the improved estimation of each one of these quantities by integration or use of all other 
quantities of interest. If each quantity was modeled using a separate GP, the objective is 
to improve one CPs prediction estimates given all other GP models. 

Multi-task Gaussian processes (MTGPs or multi-output CPs or Dependent GPs) ex- 
tend Gaussian processes to handle multiple correlated outputs simultaneously. The main 
advantage of this technique is that the model exploits not only the spatial correlation of 
data corresponding to one output but also those of the other outputs. This improves GP 
regression/prediction of an output given the others, thus performing data fusion. Figure [l] 
shows a simulated example of this concept. 

Let the number of outputs/tasks that need to be simultaneously modeled be denoted 
by nt. Equations |4j [5] and [6] represent respectively the MTGP data fusion model, the 
regression estimates and their uncertainties, subject to the following modifications to the 
basic notation. The set 



[ Zi , Z2 , Z3 



Znt 



represents the output values of the selected training data from the individual nt tasks that 
need to be simultaneously modeled. The term 



X — [Xi , X2 , X'i 



nt 



denotes the input location values (east, north, depth) of the selected training data from the 
individual data sets. Any kernel (Rasmussen and Williams, 2006) may be used and even 
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1 Z 3 4 5 6 



Figure 1: A simple demonstration of the MTGP/DGP concept demonstrating data fusion. 

Two sine waves (black) are to be modeled. One is an inverted function of the 
other. Noisy samples are available all over one of them (red) whereas the other 
one has noisy samples only in one part of it (green) . Merely using these few green 
samples would result in a poor prediction of the sine wave in the areas devoid 
of samples. Using the spatial correlation with the red sampled sine wave enables 
the MTGP approach to improve the prediction of the green sampled sine wave. 
The figure above shows the predictions of the CPs given the other GP (red/blue 
circles) and that of the second GP taken alone (green plus marks). The figure 
below shows the uncertainty in predictions (error bars of two standard deviations 
about mean) of the second GP taken alone (green) and that when taken together 
with the first GP (blue) - a clear reduction in uncertainty is observed. 
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different kernel could be used for different data sets using the technique demonstrated in 



(Melkumyan and Ramos 2011) (for stationary kernel) or the convolution process technique 



demonstrated in (Higdon, 2002 Boyle and Frean, 2004 Vasudevan et al. , 2010b 2011) and 



in this paper (for both stationary and nonstationary kernel) . The covariance matrix of the 
training data is given by 



K{X,X) 



where 



-^11 

^21 



^12 



nt 



^nt nt 



K 



Y 



Kli{X„X,) + at I 

Ky^{Xi, Xj) . 



Here, represents the auto-covariance of the i^^ data set with itself and K^j represents 
the cross covariance between the i*^ and j*'^ data sets. These terms model the covariance be- 
tween the noisy observed data points {z values). Thus, they also take the noise components 
of the individual data sets / GPs into consideration. The corresponding noise free terms 
are respectively given by and K^j. These are derived by using the process convolution 
approach to formulating Gaussian processes; details of this follow in the next subsection. 
The covariance matrix between the test points and training points is given by 



K{X,,X) = [Kli{X,,Xi),Kg{X,,X2),...,Ky„t{X,,Xnt)] , 

where ie{l ... nt} is the GP that is being evaluated given all other GPs. The matrix 
K{X, X=k) is defined likewise. Finally, the covariance of the test points is given by 

K{X,,X,) = Ky,iX„X,)+afl , 

assuming the GP needs to be evaluated for the particular test point. The mean and 
variance of the concentration estimate can thus be obtained by applying Equations [5] and 
[6| after incorporating multiple outputs/tasks, multiple GP/noise hyperparameters and de- 
riving appropriate auto and cross covariances functions that model the spatial correlation 
between the individual data sets. Data fusion is thus achieved in the MTGP approach by 
correlating individual heterogeneous outputs/tasks and using this correlation information 
to improve the prediction estimates of each of them. 



3.3 Derivation of the auto and cross covariance terms 

The main challenge in the use of multi-task GPs is the derivation of closed form cross (and 
auto) covariance functions. The process convolution approach to modeling GPs, proposed 
in ( [Higdon 2002), can address this problem. The cited paper (1) modeled a GP as the 
convolution of a "smoothing kernel" and a Gaussian white noise process, (2) expressed 
a relationship between the "smoothing kernel" and the corresponding covariance function 
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through the Fourier transform, (3) noted that for stationary isotropic kernels, there ex- 
isted a one-to-one relationship between the covariance function and its smoothing kernel 
and that for non- isotropic and/or non-stationary kernels, there was no unique solution to 
the smoothing kernel and (4) hinted at how this approach may be used to develop GP 
models with complex properties (e.g. nonstationarity) . As a consequence of this approach, 
modeling the GP amounted to modeling the hyperparameters of the smoothing kernel. For 
the second point above, the paper suggested that the smoothing kernel for a covariance 
function could be obtained as the Inverse Fourier Transform of the square root of the spec- 
trum (Fourier transform) of the covariance function. The process convolution approach 



to MTGPs has been used with the stationary SQEXP kernel in (Boyle and Frean 2004 



Alvarez and Lawrence 



( Vasudevan et al. 



2011 



2009 Vasudevan et al. 2010b) and the nonstationary NN kernel in 



Vasudevan 2012 ). Once the smoothing kernel is identified for a co- 



variance function, the cross-covariance between two covariance functions can be derived as 



a kernel correlation between the respective smoothing kernels (see Boyle and Frean 2004 ) . 



The following mathematical formalism is based on (Higdon, 2002) and (Boyle and Frean 



2004). 



Yi{s) = Ui{s) + Wi{s) 
Ui(s) = I ki{s,X) * X{X) dX 



(8) 
(9) 



Kii{Sa,Sb) 



E{U,{Sa)U,{sb)} 

E' I y ki{sa,a).X{a)da 
ki{sa,a) kj{sb,a) da 
ki{sa,a) ki{sb,a) da 



k,isb,P).Xif3)d(3 



(10) 
(11) 



Mathematically, if Yi{s) represents the observed data in Equation [8| it is expressed 
as a combination of a noise- free GP Ui{s) and Gaussian white noise process Wi{s). The 
GP Ui{s) is further modeled as a convolution of a smoothing kernel ki{s, X) and a Gaussian 
white noise process ^(A), as shown in Equation [9| A stationary and/or isotropic smoothing 
kernel would take the form ki{s — A) as it would be a function of the distance between the 
input points. If two covariance functions (corresponding to two GPs Ui{s) and Uj{s)) have 
smoothing kernels ki{sa, X) and kj{sb,X) respectively, then the cross covariance between 
them can be derived as shown in Equation 10 The auto covariance can be deduced from 



the cross covariance expression and take the form shown in Equation 11 

ki{xa, a) p da 



kernel ki and kj need to be finite energy kernels i.e. J 



The smoothing 
< oo. This can be 



intrinsically true of some kernel (e.g. squared exponential kernel) or can be true subject to 
the bounded application of the kernel (e.g. neural network kernel). 



The work (Melkumyan and Ramos, 2011) suggested that if a covariance function could 



be written as a convolution of its "basis functions" (the form specified in Equation 11), then 
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a cross-covariance between two covariance functions could be derived as a kernel correlation 
of their respective basis functions (the form specified in Equation 10). The paper proved 
that the resulting cross-covariance would be positive definite. In order to find the basis 
function for a particular covariance function, the paper derived an expression in terms of 
its Fourier transform. This relationship is identical to that suggested by (Higdon, 2002) 
and valid for stationary kernels only. The paper also derives closed form cross-covariance 
functions for different combinations of stationary kernels including the squared exponential, 



Matern 3/2 and a sparse covariance function developed by the authors in (Melkumyan and 



Ramos, 2009). 



This paper argues that both of these methods using the "smoothing kernel" (Higdon 



2002) and the "basis functions" (Melkumyan and Ramos, 2011 ) are actually equivalent with 



the former providing a sound basis to explain the latter as well as a powerful framework 
to develop other complex GP models such as space-time models and nonstationary GPs. 
The key insight obtained here is in the methodology of identifying the smoothing kernel 
for the process convolution approach. If the covariance function is a stationary kernel, 
there is an exact one-to-one relationship between the covariance function and the smoothing 



kernel as pointed out in ( Higdon , 2002 ) and whose expression is derived in ( Melkumyan and 
Ramos, 2011 ). If the covariance function is nonstationary, several possible smoothing kernels 

However, 



2002) 



may lead to the same covariance function, as pointed out in (Higdon 
attempting to express the kernel in a separable form (e.g. as the correlation of two identically 
formed basis functions) and thereby identifying the smoothing kernel would be one possible 
approach, if the form of the kernel form allowed for such separation. Needless to say, this idea 
would be applicable only in a restricted class of covariance functions and finding a universal 
approach to identifying the smoothing kernel for other nonstationary kernel remains an 
open question. Given the smoothing kernel of the covariance functions in consideration, the 



2002t Boyle and Frean 


Ramos 




2011). 



derived as a kernel correlation as demonstrated in (Higdon 


Vasudevan et al. , 


2011 


Vasudevan 


2012; jMelkumyan and 



Assume two GPs N{0,ki) and N{0,kj), with with length scale matrices Sj and 
Based on (Boyle and Frean, 2004[ ), the cross and auto covariances for the stationary SQEXP 
kernel are given by Equations |12| and 13 respectively. The corresponding expressions for the 
nonstationary NN kernel are derived in (Vasudevan et al. , 2011 ; Vasudevan, 2012 ) and given 



in Equations 14 and 15 respectively. For the Matern 3/2 kernel, the expressions for the cross 



covariance and auto covariance are derived in (Melkumyan and Ramos, 2011) and given in 



Equations 16 and 17 respectively. Also based on (Melkumyan and Ramos, 2011), the cross 



covariance function between an SQEXP and a Matern 3/2 kernel is given by Equation 18 



Kij{x,x) = Kf(i,j) — ^— ^ exp ( -^(x - x')^Sjj(x - x') 



(12) 



where 



— Sj(Sj -|- Sj) Yjj — Sj(Sj -|- Sj) S 

1 



K\',{x,x') = Kf{i,i) 



, expi— -(x — x') $]j(x — x') 
S,:|2 V 4 



(13) 
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I - 1 I - 



(14) 



where 



— 2 Sj (Sj + Sj) Yij 



K^i{x,x') = Kf{i,i) A:ArAr(x,x', Si) 



(15) 



2/2/2 

K^^{x,x') = KJ{^,J) n 7^ 



■jk 



(16) 



where k e 1 ... d is the dimension of the input data, /j and Ij are the length scales for the two 
Matern 3/2 kernel based GPs i and j, lik and Ijk are the /c*'* length scales (corresponding 
to the k''^ dimensions) of these GPs and = Ij;^ — is the distance in the A;*^ dimension 
between the input data. 



Kfi{x,x') = Kf{i,i) kMATERN3{^,x',T.i) 



(17) 



K^j{x,x') 



K 



l<k<d 



2cosh(^)- 

y 'Mfc j 



e 'Affc erf ( Afc + 



SEk 



■ T 



- t ( \ '^k 

Mk erf Afc 



(18) 



where = ^'f^, erf (x) = ^ Jq e *^dt, /c e 1 . . . d is the dimension of the input data, 
IsE and /m are the respective length scales for the SQEXP and Matern 3/2 kernel based 
GPs i and j, IsEk and Imu are the k''^ length scales (corresponding to the k^^ dimensions) 
of these GPs and = \xk — x'^\ is the distance in the k^^ dimension between the input 
data. 



In Equations 14 and 15, the term, ki\f]sf{x,x' , T,ij), is the NN kernel for two datax, x' and 



length scale matrix Sjj. It is given by Equation 2l excluding the signal variance term (a 



ters 



Likewise, in Equation 17 kMATERNi^-, x', Sj) refers to the Matern 3/2 kernel for two data x, 
x' and length scale matrix Sjj, given by Equati on jsj (excluding the CTj term). The Kj terms 



in Equations 12, 13, 14 and 15 are inspired by (Bonilla et al. , 2007[ ). This term models the 



task similarity between individual tasks. Incorporating it in the auto and cross covariances 
provides additional flexibility to the multi-task GP modeling process. It is a symmetric 
matrix of size nt x nt and is learnt along with the other GP hyperparameters. Thus, the 
hyperparameters of the system that need to be learnt include {nt.{nt + l))/2 task similarity 
values, nt .2 or nt .3 length scale values respectively for the individual SQEXP/MATERN3 
or NN kernels and nt noise values corresponding to the noise in the observed data sets. 
Learning these hyperparameters by adapting the GP learning procedure described before 
(Equation [T]) for multiple outputs/tasks ( |Vasudevan et al. 2010b, 2011). 
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An alternative generalized multi-task GP formulation accomodating input dependent 
signal and noise correlations, input dependent length scales and amplitudes, and heavy 
tailed predictive distributions between multiple output variables has recently been sug- 



gested in (Wilson et al. , 2012). The formulation combines the structural properties of 
neural networks with the nonparametric flexibility of GPs; it does not require the explicit 
computation of closed form expressions for auto and cross covariances. 

4. Experiments 

Experiments were conducted on a large scale geological resource data set made up of real 
sensor data. The data consists of 63,667 measurements from a 3478.4 m x 1764.6 m x 
345.9 m region in Australia that has undergone drilling and chemical assays to determine 
its composition. The holes are generally 25-lOOm apart and tens to hundreds of meters 
deep. Within each hole, data is collected at an interval of 2m. The measurements include 
the (east, north, depth) position data along with the concentrations of three elements, 
Element-1, Element-2 and Element-3, hereafter denoted as El, E2 and E3 respectively. 
These three quantities are known to be correlated and hence the objective is to use each of 
their GP models to improve the others' prediction estimates by capturing the correlation 
between these quantities. The data set is shown in Figure [2j The methodology of testing is 



described in Section 4.1 Multiple metrics have been used to evaluate the methods, these 



are described in Section 4.2 Results obtained are then presented and discussed in Section 



4.3 Outputs of the data fusion process provided by the best performing model as suggested 



by the evaluation are also presented. 



4.1 Testing procedure 



The objective of the experiment was to compare the multi-task GP approach with a con- 
ventional GP approach and quantify if the data fusion in the MTGP actually improves 
estimation. A second objective of the experiments was to compare the nonstationary NN 
kernel with the stationary SQEXP kernel, the Matern 3/2 kernel and a combination of them 



that proved effective in prior testing (Melkumyan and Ramos, 2011 ). Towards these aims, a 



ten fold cross validation experiment was performed on the data set, with each of the kernels. 



This was motivated by the work (Kohavi, 1995), which suggests a ten fold stratified (similar 



number of samples in each fold) cross validation as the best way of testing the estimation 
accuracy of machine learning methods on real world data sets. 

The MTGP and simple GP approaches each require an optimization step for model 
learning. The optimization step in each method can result in different local minima in 
each trial (and with each kernel). Thus, to do a one-on-one comparison between the two 
approaches and quantify their relative performances, an exact comparison is required. The 
benchmarking experiment presented in this paper provides an exact comparison between 
the MTGP and GP approaches. To do this, 

• The best available MTGP parameters were found for each kernel. From this, appro- 
priate subsets of the parameters were chosen for the GP approach. 

• The approaches were compared on identical test points and identical training/evaluation 
points selected for each of the test points. 
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• It is also necessary that the covariance function for the simple GP approach must be 
identical to the auto-covariance function of the DGP approach. For this reason, the 
auto-covariance function (for both kernels) is used as the covariance function for the 
GP approach to data fusion. 



In addition to this, three independent GPs (denoted as GPI here after) were optimized 
for El, E2 and E3 and their estimates for the same set of test points were also compared. 
Thus the effect of information integration in the context of the geological resource modeling 
can be seen in terms of both an exact comparison (MTGP vs GP) and an independent 
comparison (MTGP vs GPI). 

For the cross validation, a "block" sampling technique was used, a 3D version of the 
"patch" sampling method used in (Vasudevan et al. 2009) (see Figure [3]). The idea was 
that rather than selecting test points uniformly, blocks of data test the robustness of the 
approach better as the support points to the query point are situated farther away than in 
uniform point selection. The data set is gridded into blocks of different sizes. Collections 
of blocks represent individual folds. In each cross validation test, one fold was designated 
as a test fold and points from it were used exclusively for testing. All other folds together 
constituted the evaluation data, a small subset of which were labeled as the training data. 
Note that this technique of testing will naturally lead to larger errors. For the test fold, 
the El, E2 and E3 concentrations (and error metrics defined in the following section) are 
estimated first using the MTGP approach, then with the GP approach using parameters 
from the optimized MTGP parameters and finally, with an independently optimized GP for 
each of the three quantities. The result of a 10 fold cross validation test is a 63,667 point 
evaluation in tougher test conditions than what would be attainable with uniform sampling 
(e.g. every tenth point) of test points. 

Block sizes were chosen empirically, in proportion (arbitrarily rounded up or down) 
to the dimensions of the whole data set and with a view of performing a stratified cross 
validation test. The block sizes chosen and the resulting implications on the cross validation 
testing are shown in Table [T] The smaller block size of 22m x 11m x 2m results in each fold 
having a similar number of points (i.e. numbers of points in folds with min/max test points 
are similar) and thus results in the most stratified cross validation test. With increasing 
block size, prediction error increases (support data is farther away), stratification is reduced 
and hence, variance in prediction error also increases. Uniform sampling of test points may 
be considered as a limiting case of block sampling with the smallest block size possible. 



4.2 Metrics 

Multiple metrics have been used to understand the various methods being tested. They are 
briefly described below. These are evaluated for each test point in each fold of the cross 
validation test. The result would then be represented by the mean and standard deviations 
of all values across all folds. 



1. Squared Error (SE): This represents the squared difference between the predicted 
concentration and the known concentrations for the set of test points. The mean over 
the set of all test points (Mean Squared Error or MSE) is the most popular metric 
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for the context of this paper. Referring Equations [H] and [6| for the i^^ test point, 

SE{{) = {h{i)-Zif 

Variance (VAR): This represents the variance (uncertainty) in the predicted concen- 
trations for the set of test points, a lower VAR is a good outcome, only if the SE 
is also low. A model that has high SE and low VAR would be a poor model as this 
result would suggest that the model is confident of its inaccurate estimates. A better 
outcome would be a model with high SE and correspondingly high VAR i.e. a model 
that has inaccurate predictions but is also uncertain about these predictions. 



3. Negative log probability / Log loss (NLP): Inspired by (Rasmussen and Williams , 2006 ) 
(see page 23), this is a measure of the extent to which the model (including the GP 
model, kernel, parameters and evaluation data) explain the current test point. The 
lower the value of this metric, the better the model. For the i^^ test point, 



2 *' 2a4i) 

4.3 Results 



Table 1: 10 fold cross validation with block sampling; 63667 points in data set spread over 
3478.4 m x 1764.6 m x 345.9 m; block sizes tested vs relative implications on 
results 



Block size 


Number of points 


Number of points 


Comments on 


(m) 


in fold with MIN 
test points 


in fold with MAX 
test points 


cross validation test 


22 X 11 X 2 


6209 


6454 


Most stratified cross validation 
Least prediction error 


44 X 22 X 4 


6183 


6456 


stratification J, prediction error f 


87 X 45 X 9 


5807 


6739 


stratification J, prediction error t 


174 X 89 X 18 


5133 


7549 


stratification J, prediction error f 


348 X 177 X 35 


4976 


9662 


stratification J, prediction error f 


696 X 353 X 70 


1204 


10371 


Least Stratified cross validation 
Highest prediction error 
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i: 




I 



I 



(a) Element- 1 concentration 




(b) Element-2 concentration 




(c) Element-3 concentration 



Figure 2: The geological resource data set. Figures [2(a) 2(b) and 2(c) respectively show 



the concentrations of three elements over the region of interest. The central region 
of points is surrounded by sparse sets of points which are not pre-filtered when 
applying the proposed algorithm. 
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Figure 3: Example of 3D block sampling of a geological resource data set. Blocks may 
be sampled of different sizes. The red and yellow blocks represent blocks from 
two of the ten folds used in cross validation testing. Test points within these 
blocks have "support" data away from them, outside the blocks. This sampling 
method is therefore a stronger test of the robustness of an approach to estimating 
the quantity of interest, as compared to uniformly sampling test points. The 
estimation errors however, will be higher than that obtained for a uniformly 
sampled set of points. 



19 



Vasudevan, Melkumyan and Scheding 



MTGP Squared Error (SE) 




Block size -> 
1200 



1000 



800 



GOO 



400 
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B C E 

MTGP Squared Error (SE) range 



SQEXP 
MS 



JQ 



Block size -> A 



Figure 4: Element El, MTGP approach, SE metric. The figure above shows the average 
values; the one below shows the range of values obtained considering two standard 
deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 x 22 x 
4), C (84 X 45 X 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 x 353 x 70). 
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Figure 5: Element El, MTGP approach, VAR metric. The figure above shows the average 
values; the one below shows the range of values obtained considering two standard 
deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 x 22 x 
4), C (84 X 45 X 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 x 353 x 70). 
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Figure 6: Element El, MTGP approach, NLP metric. The figm'e above shows the average 
values; the one below shows the range of values obtained considering two standard 
deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 x 22 x 
4), C (84 X 45 X 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 x 353 x 70). 
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MTGP vs GP vs GPI (NN kernel) Squared Error (SE) 
350 I 1 1 1 1 




Block size -> A B C 1) E F 



MTGP vs GP vs GPI (NN kernel) Squared Error (SE) range 
1500 I 1 1 1 1 1 r- 




Block size -> A B C D E F 



Figure 7: Element El, MTGP vs GP vs GPI approaches, SE metric. The figure above shows 
the average values; the one below shows the range of values obtained considering 
two standard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), 
B (44 X 22 X 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 x 35) and F 
(696 X 353 X 70). 
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4.5 



MTGP vs GP vs GPI (NN kernel) Negative L09 Probability (NLP) 
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Block size -> A B C B E F 

MTGP vs GP vs GPI (NN kernel) Negative L09 Probability (NLP) 




Block size -> A 



Figure 8: Element El, MTGP vs GP vs GPI approaches, NLP metric. The figure above 
shows the average values; the one below shows the range of values obtained con- 
sidering two standard deviations about the mean. Test block sizes (m) - A (22 x 
11 X 2), B (44 X 22 X 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 x 35) 
and F (696 x 353 x 70). 
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MTGP-MM vs GPI- 



vs MTGP-SQEXP vs GPI-SQEXP Squared Error (SE) 
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MTGP-MM vs GPI-MM vs MTGP-SQEXP vs GPI-SQEXP Squared Error (SE) 
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Block size -> A 



Figure 9: Element El, MTGP vs GPI approaches, MM and SQEXP kernels, SE metric. 

The figure above shows the average values; the one below shows the range of 
values obtained considering two standard deviations about the mean. Test block 
sizes (m) - A (22 x 11 x 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), 
E (348 X 177 X 35) and F (696 x 353 x 70). 
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MTGP-MM vs GPI-MM vs MTGP-SQEXP vs GPI-SQEXP Negative L09 probability (NLP) 
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MTGP-MM vs GPl-MM vs MTGP-SQEXP vs GPI-SQEXP Negative Log probability (NLP) 




Block size -> Pi 



Figure 10: Element El, MTGP vs GPI approaches, MM and SQEXP kernels, NLP metric. 

The figure above shows the average values; the one below shows the range of 
values obtained considering two standard deviations about the mean. Test block 
sizes (m) - A (22 x 11 x 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), 
E (348 X 177 X 35) and F (696 x 353 x 70). 
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Figure 11: Element El, MTGP vs GP vs GPI approaches, MS kernel, SE metric. The 
figure above shows the average values; the one below shows the range of values 
obtained considering two standard deviations about the mean. Test block sizes 
(m) - A (22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E 
(348 X 177 x 35) and F (696 x 353 x 70). 
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MTGP vs GP vs GPI (MS kernel) Negative L09 Probability (NLP) 
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Figure 12: Element El, MTGP vs GP vs GPI approaches, MS kernel, NLP metric. The 
figure above shows the average values; the one below shows the range of values 
obtained considering two standard deviations about the mean. Test block sizes 
(m) - A (22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E 
(348 X 177 x 35) and F (696 x 353 x 70). 
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Figure 13: Element E2, MTGP approach, SE metric. The figure above shows the average 
values; the one below shows the range of values obtained considering two stan- 
dard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 
X 22 X 4), C (84 X 45 X 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 x 
353 X 70). 
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Figure 14: Element E2, MTGP approach, VAR metric. The figure above shows the aver- 
age values; the one below shows the range of values obtained considering two 
standard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B 
(44 X 22 X 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 
X 353 X 70). 
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Figure 15: Element E2, MTGP approach, NLP metric. The figure above shows the aver- 
age values, the right image being a zoomed in view of the left; the one below 
shows the range of values obtained considering two standard deviations about 
the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 x 22 x 4), C (84 x 45 x 
9), D (174 X 89 X 18), E (348 x 177 x 35) and F (696 x 353 x 70). 
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MTGP vs GP vs GPI (NN kernel) Squared Error (SE) 
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Figure 16: Element E2, MTGP vs GP vs GPI approaches, SE metric. The figure above 
shows the average values; the one below shows the range of values obtained 
considering two standard deviations about the mean. Test block sizes (m) - A 
(22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 
X 35) and F (696 x 353 x 70). 
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MTGP vs GP vs GPI (NN kernel) Negative L09 Probability (NLP) 
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Figure 17: Element E2, MTGP vs GP vs GPI approaches, NLP metric. The figure above 
shows the average values; the one below shows the range of values obtained 
considering two standard deviations about the mean. Test block sizes (m) - A 
(22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 
X 35) and F (696 x 353 x 70). 
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Figure 18: Element E2, MTGP vs GPI approaches, MM and SQEXP kernels, SE metric. 

The figure above shows the average values; the one below shows the range of 
values obtained considering two standard deviations about the mean. Test block 
sizes (m) - A (22 x 11 x 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), 
E (348 X 177 X 35) and F (696 x 353 x 70). 
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Figure 19: Element E2, MTGP vs GPI approaches, MM and SQEXP kernels, NLP metric. 

The figure above shows the average values, the right image being a zoomed in 
view of the left; the one below shows the range of values obtained considering 
two standard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), 
B (44 X 22 X 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 x 35) and F 
(696 X 353 X 70). 
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Figure 20: Element E2, MTGP vs GP vs GPI approaches, MS kernel, SE metric. The 
figure above shows the average values; the one below shows the range of values 
obtained considering two standard deviations about the mean. Test block sizes 
(m) - A (22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E 
(348 X 177 x 35) and F (696 x 353 x 70). 
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MTGP vs GP vs GPI (MS kernel) Negative Log Probability (NLP) 
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Figure 21: Element E2, MTGP vs GP vs GPI approaches, MS kernel, NLP metric. The 
figure above shows the average values; the one below shows the range of values 
obtained considering two standard deviations about the mean. Test block sizes 
(m) - A (22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E 
(348 X 177 x 35) and F (696 x 353 x 70). 
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Figure 22: Element E3, MTGP approach, SE metric. The figure above shows the average 
values; the one below shows the range of values obtained considering two stan- 
dard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 
X 22 X 4), C (84 X 45 X 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 x 
353 X 70). 
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Figure 23: Element E3, MTGP approach, VAR metric. The figure above shows the aver- 
age values; the one below shows the range of values obtained considering two 
standard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B 
(44 X 22 X 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 
X 353 X 70). 
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Figure 24: Element E3, MTGP approach, NLP metric. The figure above shows the average 
values; the one below shows the range of values obtained considering two stan- 
dard deviations about the mean. Test block sizes (m) - A (22 x 11 x 2), B (44 
X 22 X 4), C (84 X 45 X 9), D (174 x 89 x 18), E (348 x 177 x 35) and F (696 x 
353 X 70). 
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MTGP vs GP vs GPI (NN kernel) Squared Error (SE) 
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Figure 25: Element E3, MTGP vs GP vs GPI approaches, SE metric. The figure above 
shows the average values; the one below shows the range of values obtained 
considering two standard deviations about the mean. Test block sizes (m) - A 
(22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 
X 35) and F (696 x 353 x 70). 
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MTGP vs GP vs GPI (NN kernel) Negative L09 Probability (NLP) 
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Figure 26: Element E3, MTGP vs GP vs GPI approaches, NLP metric. The figure above 
shows the average values; the one below shows the range of values obtained 
considering two standard deviations about the mean. Test block sizes (m) - A 
(22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E (348 x 177 
X 35) and F (696 x 353 x 70). 
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MTGP-MM vs GPI-MM vs MTGP-SQEXP vs GPI-SQEXP Squared Error (SE) 




Block size -> A B C B E F 

MTGP-MM vs GPI-MM vs MTGP-SQEXP vs GPI-SQEXP Squared Error (SE) 
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Figure 27: Element E3, MTGP vs GPI approaches, MM and SQEXP kernels, SE metric. 

The figure above shows the average values; the one below shows the range of 
values obtained considering two standard deviations about the mean. Test block 
sizes (m) - A (22 x 11 x 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), 
E (348 X 177 X 35) and F (696 x 353 x 70). 
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Figure 28: Element E3, MTGP vs GPI approaches, MM and SQEXP kernels, NLP metric. 

The figure above shows the average values; the one below shows the range of 
values obtained considering two standard deviations about the mean. Test block 
sizes (m) - A (22 x 11 x 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), 
E (348 X 177 X 35) and F (696 x 353 x 70). 
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Figure 29: Element E3, MTGP vs GP vs GPI approaches, MS kernel, SE metric. The 
figure above shows the average values; the one below shows the range of values 
obtained considering two standard deviations about the mean. Test block sizes 
(m) - A (22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E 
(348 X 177 x 35) and F (696 x 353 x 70). 
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0' L. 

Block size -> Pi 



Figure 30: Element E3, MTGP vs GP vs GPI approaches, MS kernel, NLP metric. The 
figure above shows the average values; the one below shows the range of values 
obtained considering two standard deviations about the mean. Test block sizes 
(m) - A (22 X 11 X 2), B (44 x 22 x 4), C (84 x 45 x 9), D (174 x 89 x 18), E 
(348 X 177 x 35) and F (696 x 353 x 70). 
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i 



(c) Uncertainty in El predictions constituting the 2D section view 



Figure 31: Figures 31(a)[ 31(b) and 31(c) respectively show the predicted El concentrations 
(over the entire region) superimposed with the input data, a 2D section-view of 
the output data and the uncertainty in the predicted concentrations for the 2D 
view. Expectedly, the uncertainty is low around regions where input/given data 
exist and rapidly rises for predictions away from such areas - typically, the fringe 
areas. The 2D section view shows two red regions corresponding two regions of 
high El concentration. The corresponding regions in Figures 32 and 33 show 
low E2 and E3 concentrations respectively. 



Figures 31, 32 and 33 show the predicted concentrations of El, E2 and E3 over the 
entire region of interest as well as 2D section views of this output and the uncertainty of the 
predictions that constitute it; these were produced using multi-task GPs using the Neural 



47 



Vasudevan, Melkumyan and Scheding 




Figure 32: Figures 32(a)[ 32(b) and 32(c) respectively show the predicted E2 concentrations 
(over the entire region) superimposed with the input data, a 2D section-view of 
the output data and the uncertainty in the predicted concentrations for the 2D 
view. Expectedly, the uncertainty is low around regions where input/given data 
exist and rapidly rises for predictions away from such areas - typically, the fringe 
areas. The 2D section view shows two violet regions corresponding two regions 
of low E2 concentration. The corresponding regions in Figure |3l] show high El 
concentration and those from Figure |33] show low E3 concentration. 
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Figure 33: Figures 33(a)[[33(b)| and 33(c) respectively show the predicted E3 concentrations 
(over the entire region) superimposed with the input data, a 2D section-view of 
the output data and the uncertainty in the predicted concentrations for the 2D 
view. Expectedly, the uncertainty is low around regions where input/given data 
exist and rapidly rises for predictions away from such areas - typically, the fringe 
areas. The 2D section view shows two violet regions corresponding two regions 
of low E3 concentration. The corresponding regions in Figure [31] show high El 
concentration and those from Figure [32] show low E2 concentration. 
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Network kernel. Tables [2| [3] and |4] show the results of the cross validation testing on the 
geological resource data set with the Neural Network (NN), Matern 3/2 (MM), Squared 
Exponential (SQEXP) and Matern 3/2 - Matern 3/2 - Squared exponential (MS) kernels. 
Figures [4| through 1 1 2| depict the main results of Table [2] (element El), Figures 13 through 21 



depict the main results of Table [3] (element E2) and Figures [22| through 30 depict the main 
results of Table [4] (element E3). The following observations were made from the results 
obtained. 



Prediction error (SE) increases with increase in block size. This is because the support 
training data required for regressing at a test point is situated farther away. Increasing 
the test block size also results in reduced stratification as one fold of the cross vali- 
dation may have e.g. 10,000 test points whereas another may have only 1000 points. 
This results in increased standard deviation of prediction error. A ten fold stratified 
cross validation is generally considered to be the most representative of performance 
measure ( [Kohavi 1995), however testing multiple larger block sizes provides a better 
understanding of the model's behavior and robustness, (see Tables l2| ^ and Hi Figures 



ll [71 |9l [TT] for El , Figures [Tsl [Tel [Tsl [20] for E2 and Figures |22l [25 127) [29) for E3 



Further optimization of each of these cases could yield better results. The results 
shown are the result of a reasonable amount of optimization applied to each kernel 
and GP model. Typically, multiple attempts were performed and the best results ob- 
tained were pursued/used. One iteration consisted of either a stochastic optimization 
step (simulated annealing) and/or a gradient based optimization step (Quasi Newton 
optimization with BFGS Hessian update) with 10,000 training data chosen uniformly 



from the data. This work uses a "block- learning" approximation (Vasudevan et al. 



2010b) which approximates the total marginal likelihood as a sum of a sequence of 



marginal likelihoods computed over blocks of points comprising the training data. The 
size of the block is defined by the computational resources available. The stochastic 
optimization step was the most time consuming part; each attempt was started with 
completely random parameters. The code was unoptimized MATLAB code running 
typically on an 8-core processor based machine. Most times, not all the cores were 
used for the same process; multiple processes also shared the same system. Note that 
the experiments in this paper do not use analytical gradients for the optimization of 
the hyperparameters; this was a design choice made in the interest of stability and 
comparability of the optimization results across kernels. The use of analytical gra- 
dients can significantly reduce the total training time. Training time may also be 
reduced significantly by various other ways including other approximations, intelli- 
gently setting initial parameters, scaling the data etc. 
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Model 


Kernel 


Number of training attempts, iterations 
Total training time for successful attempt 


MTGP 


ATAT 

IN IN 
MM 
SQEXP 
MS 


2 attempts, 6 iterations, total training time = 78.89 hours 

3 attempts, 2 iterations, total training time = 222.15 hours 

4 attempts, 1 iteration, total training time = 92.52 hours 
2 attempts, 3.5 iterations, 3 iterations took 113.69 hours 


GPI 


NN 

MM 

SQEXP 


3 attempts, 2 iterations, total training time = 41.07 hours 
2 attempts, 1 iteration, total training time = 48.91 hours 
2 attempts, 1 iteration, total training time = 47.67 hours 



Rather than the individual training times, the relative amount of training (under simi- 
lar conditions, with different kernel) required to produce a reasonable set of parameters 
is of more interest. Experience suggests that the NN kernel based MTGP/GP models 
converged faster and better as compared to other kernels. 

Note that the observations below are general summaries of trends. Exceptional cases 
(e.g. the observed trend is not violated in two cases for E3) are attributed to the need 
for further optimization for a subset of the model parameters (those of E3), unless 
otherwise specified. 

MTGP-NN vs MTGP-MM: The NN kernel is the best performing kernel of the four 
tested, across all block sizes tested. For small block sizes, both the NN and MM 
kernel are competitive. Note however that the MM kernel produces lower VAR for a 
higher SE, meaning that it is more confident of its SE values which are worse/higher 
than those of the NN kernel. This makes its NLP higher and the model poorer than 
an MTGP based on the NN kernel. Note also that as the test block size increases, 
the advantage in performance of the MTGP based on the NN kernel over that based 
on the MM kernel becomes more distinctive. Not only are the SE values smaller for 
the NN kernel, the NLP values remain in the same range whereas those of the MM 
kernel rise significantly. This proves that the MTGP-NN is better performing and 
more robust than the MTGP-MM. The latter property suggests that the MTGP-NN 
will be able to cope better with incomplete data sets, (see Tables [2| [sj |4| Figures |4| 
[51 [el for El, Figures pi pi flsl for E2 and Figures l22l 121 [241 for E3.) 



• MTGP vs GP vs GPI: For the NN kernel, the MTGP metrics are always lower than 
the corresponding derived GP (GP) or independent GP (GPI) metrics - lower SE (bet- 
ter estimate) with reduced VAR (less uncertainty) and reduced NLP (better model) is 
produced. This clearly demonstrates the benefits of information fusion across hetero- 
geneous information sources so as to improve individual predictions using the MTGP 
model. For the MM and MS kernels, this desired trend is observed for smaller block 
sizes. For larger block sizes, the increase in SE (error) is not met with an appropri- 
ate increase in VAR (prediction uncertainty). As a result, even though the MTGP 
model produces smaller errors than the GP/GPI models, the latter would probably 
be a better choice with respect to the NLP metric. The SQEXP kernel is discussed 
separately below, (see Tables [2| [sj [§ Figures [Tj [s) [l6j [TTj [isj [26| for the NN kernel, 
Figures [H} [l2[ [20[ [211 [29[ [30| for the kernel and Figures [9} [lO[ [l8[ [T9[ [27} [gH for 



the MM kernel.) 
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MTGP-SQEXP: The MTGP-SQEXP model performs poorly in comparison with the 
equivalent models using the NN/MM kernels, with respect to both SE and NLP. 
For elements El and E3, the MTGP-SQEXP has a better SE than the correspond- 
ing model based on the MS kernel; it has an SE better than the corresponding de- 
rived/independent GP models but an inferior (overconfident) VAR and a fluctuating 
NLP trend. For element E2, the MTGP-SQEXP is worse off than both the equivalent 
model based on the MS kernel as well as its corresponding GP models, (see Tables [2 



3J [4J Figures § [TOKTT} [12] for El, Figures [T8j [Tg} [20j [21] for E2 and Figures [27] [28] [29 
^for E3.) 

• MTGP-SQEXP: Considering the results for E2, the NLP is directly proportional to 
the SE and inversely to the prediction variance. At the smallest block size, the MTGP- 
SQEXP produces relatively high SE (with respect to e.g. MTGP-NN) but very low 
prediction variance. This basically suggests that the model is confident of its poor 
estimates - a bad outcome. This results in a high NLP and poor model. As the block 
size increases, the prediction variance increases more relative to the prediction error 
resulting in the decreasing NLP trend. For elements El and E3, the largest block size 
results in a stronger increase in prediction error than the variance in the prediction 
resulting in an increase in NLP. Overall, the MTGP-SQEXP model is poor, (see 



Tables [2] [§ [4] Figures [4] [5| [6] for El, Figures [13] [14] [15] for E2 and Figures [22] [23] [24] 
for E3.) 

• MTGP-SQEXP: The SQEXP kernel is a limiting case of the MM kernel; both are sta- 
tionary kernels. Considering the behavior of the GPI model using the SQEXP kernel 
and its competitive results with respect to those of the GPI-MM kernel, it is likely 
that the poor performance of the MTGP-SQEXP is likely due to poor optimization 
output (a poor local minima). Further investigation on this result is ongoing but the 
findings are not expected to change the conclusions of this paper, (see Tables [2] [s] [4] 
and Figures [o] [To] [Ts] [l9] [27] [28|) 

• The MS kernel is not competitive with respect to the NN and MM kernels. However, 
the MTGP using this kernel combination proves to be better than a derived GP 
and an independently optimized GP with respect to the SE metric. From the NLP 
perspective, the MTGP-MS model is more competitive than the other GP models for 
small block sizes. For larger block sizes, using an independently optimized GP proves 
to be a more trust worthy modeling option. The exception to this behavior is seen 
in the results for E3, the MTGP model is poor in this case. This is attributed to do 
with inferior parameters relevant to the element E3 obtained from the optimization 
process, (see Tables [2] [s] [4] and Figures [llj [l2] [20) [2l} [29] |30]) 

5. Discussion 

On the basis of this study, an attempt is made in answering two fundamental questions - (1) 
How can I know if I have a good MTGP model ? and (2) Which GP model or kernel should 
I use ? By no means is this intended to be a ready-made prescription, universal formula or 
short-cut to be used as a substitute for application specific and scientifically apt decisions in 
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developing Gaussian process models. Rather, this is a reflection of the authors' experiences 
based on the scope of this and past work in other domains such as terrain modeling. Note 
that there are numerous very sophisticated GP techniques (kernels, approximation etc.) 
which are beyond the scope of this work and which may change some of these inferences. 



5.1 How can I know if I have a good MTGP model ? 

In order to effectively develop and validate MTGP models, the experiments are suggestive 
of the following - 

1. The use of multiple kernel from the same family would provide a good method for 
validating the general behavior /trends of the model in question. For instance when 
developing an MTGP model based on the SQEXP kernel, developing a Matern 3/2 
kernel based MTGP model could provide a means to validate the behavior of the 
MTGP-SQEXP model. 

2. The model hyperparameter optimization performed in this paper is based on maximiz- 
ing the marginal likelihood. Typically, error metrics such as the SE being sufficiently 
low is suggestive of the model being good. A cross validation test could also be per- 
formed to ensure that this is indeed the case. However, it is also important to check 
if the model in question is under/over confident (high/low uncertainty) for a given 
level of error. This can be done, not as a standalone test, but in comparison with 
alternative models or test cases. 

3. When developing a MTGP model, it is a good idea to compare with an equivalent 
derived GP model and an independently optimized GP model. The availability of 
more information and the effective use of this information through the MTGP model 
should ideally result in significantly lower error metrics (e.g. SE) with a significant 
improvement in confidence (i.e. decrease in prediction variance, VAR) and a net 
reduction in the Negative Log Loss (NLP) metric. 

4. It may be useful to design a variety of different test cases (e.g. different test block 
sizes) and check if the performance metrics behave as expected. Such a test would 
also be indicative of the robustness of the model. 

5. It may be useful to optimize independent GP models for each task and use these 
hyperparameters as the initial parameters for the MTGP model. 



5.2 Which GP model or kernel should I use ? 



This obviously depends on the data set at hand and the constraints of the modeling prob- 
lem. The following are purely indicative, based on our experiences in multiple problem 



domains (Vasudevan et al. , 2009; Vasudevan 2012; Melkumyan and Ramos, 2011) and may 



change considering alternative kernels, other novel ways of treating the modeling problem 
or approximation methods. 

1. Time, complexity, computational resources are a premium. I need a method that just 
works: Independently optimized GP models using the Neural Network kernel or the 
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Matern 3/2 kernel would be a competitive solution. Note that the outcome will only be 
as good as the data being modeled and other information sources cannot be leveraged. 

2. / need the best possible model over a range of test sizes and I do not know much 
about my data: Multi-task GP models using the Neural Network kernel would be a 
competitive solution. 

3. / need the best possible model over a range of test sizes and I know how my data 
changes: Multi-task GP models with a kernel representative of the variation of the 
data e.g. a uniform variation (no sudden changes in trend) can be effectively modeled 
using the Matern 3/2 or Squared Exponential kernels. 

4. I need a model that can cope with sparse data and/or incomplete data sets: Neural net- 
work kernel based GP or MTGP models depending on the computational complexity 
constraints and model accuracy requirements. 

5. / have "good" multi- attribute data. I need to model this well and fast: Independent 
GP models for each of the attributes, using either a Neural Network kernel or some 
other kernel more suited to the data, would provide a competitive solution. The 
use of independent GP models will result in the ability to parallelize the modeling 
process and significantly reduce the possibility of poor models (poor local minima) as 
a consequence of a reduction in number of model parameters. Note that "good" here 
is application dependent but would certainly require being well sampled, not noisy 
and reasonably complete (no large gaps where other information modalities can be 
leveraged) . 

6. Conclusion 

This paper studied the problem of geological resource modeling using multi-task Gaussian 
processes (MTGPs) . The concentrations of three elements were modeled and predicted over 
a region of interest using the MTGP as well as individual Gaussian processes (GPs) for each 
of these quantities. The paper demonstrates that MTGPs perform significantly better than 
individual GPs at the modeling problem as they effectively integrate heterogeneous sources 
of information (concentrations of individual elements) to improve the individual predictions 
of each of them. The benefits of information integration using the MTGP as against inde- 
pendent GPs for the task of geological resource modeling have been quantified by a cross 
validation study that performed both an exact and an independent comparison between 
MTGPs and GPs. Multi-task Gaussian process models based on the Neural Network kernel 
was shown to be a competitive and robust option across a range of test block sizes. 
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